COMPUTING SCIENCE An approach to describing and analysing bulk biological annotation quality: A case study using UniProtKB

نویسندگان

  • Michael J. Bell
  • Colin S. Gillespie
  • Daniel Swan
  • Phillip Lord
  • M. J. Bell
  • C. S. Gillespie
  • D. Swan
  • P. Lord
چکیده

Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid unannotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this paper. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf’s Principle of Least Effort. We use UniProtKB as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually-curated annotations. By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality. Source code and supplementary data are available at: http://homepages.cs.ncl.ac.uk/m.j.bell1/annotation/ © 2012 Newcastle University. Printed and published by Newcastle University, Computing Science, Claremont Tower, Claremont Road, Newcastle upon Tyne, NE1 7RU, England. Bibliographical details BELL, M.J., GILLESPIE, C.S., SWAN, D., LORD, P. An approach to describing and analysing bulk biological annotation quality: A case study using UniProtKB [By] M. J. Bell, C. S. Gillespie, D. Swan, P. Lord Newcastle upon Tyne: Newcastle University: Computing Science, 2012. (Newcastle University, Computing Science, Technical Report Series, No. CS-TR-1336)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

MOTIVATION Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to a...

متن کامل

Can Inferred Provenance and Its Visualisation Be Used to Detect Erroneous Annotation? A Case Study Using UniProtKB

A constant influx of new data poses a challenge in keeping the annotation in biological databases current. Most biological databases contain significant quantities of textual annotation, which often contains the richest source of knowledge. Many databases reuse existing knowledge; during the curation process annotations are often propagated between entries. However, this is often not made expli...

متن کامل

HAMAP in 2013, new developments in the protein family classification and annotation system

HAMAP (High-quality Automated and Manual Annotation of Proteins-available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support t...

متن کامل

HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot

The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for prot...

متن کامل

The UniProtKB guide to the human proteome

Advances in high-throughput and advanced technologies allow researchers to routinely perform whole genome and proteome analysis. For this purpose, they need high-quality resources providing comprehensive gene and protein sets for their organisms of interest. Using the example of the human proteome, we will describe the content of a complete proteome in the UniProt Knowledgebase (UniProtKB). We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012